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Abstract 

We consider a linear regression problem in a high dimensional setting where the num- 
ber of covariates p can be much larger than the sample size n. In such a situation, one 
often assumes sparsity of the regression vector, i.e., the regression vector contains many 
zero components. We propose a Lasso-type estimator pQ'^"-'^ (where ^QuacT stands for 
quadratic) which is based on two penalty terms. The first one is the norm of the 
regression coefficients used to exploit the sparsity of the regression as done by the Lasso 
estimator, whereas the second is a quadratic penalty term introduced to capture some 
additional information on the setting of the problem. We detail two special cases: the 
Elastic- Net ji^'^ introduced in [32], which deals with sparse problems where correlations 
between variables may exist; and the Smooth-Lasscl^ /3'^'^, which responds to sparse prob- 
lems where successive regression coefficients are known to vary slowly (in some situations, 
this can also be interpreted in terms of correlations between successive variables) . From a 
theoretical point of view, we establish variable selection consistency results and show that 
I^Quad achieves a Sparsity Inequality, i.e., a bound in terms of the number of non-zero 
components of the 'true' regression vector. These results are provided under a weaker 
assumption on the Gram matrix than the one used by the Lasso. In some situations this 
guarantees a significant improvement over the Lasso. Furthermore, a simulation study is 
conducted and shows that the S-Lasso jS^^ performs better than known methods as the 
Lasso, the Elastic- Net and the Fused-Lasso (introduced in [30 ) with respect to the 

estimation accuracy. This is especially the case when the regression vector is 'smooth', 
i.e., when the variations between successive coefficients of the unknown parameter of the 
regression are small. The study also reveals that the theoretical calibration of the tuning 
parameters and the one based on 10 fold cross validation imply two S-Lasso solutions with 
close performance. 

Keywords: Lasso, Elastic-Net, LARS, Sparsity, Variable selection. Restricted eigenval- 
ues. High-dimensional data. 
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1 Introduction 

We focus on the usual linear regression model 

yi = Xi(3*+£i, i = l,...,n, (1) 

where the design Xj = (xj^i, . . . , Xj^p) G W is deterministic, (3* = (/3^, . . . , /?*)' G W is the 
unknown parameter, and £i,. ■ ■ ,£n, are independent, identically distributed (i.i.d.) centered 
Gaussian random variables with known variance cj^. We aim on estimating /?* in the sparse 

^The Smooth-Lasso estimator has initially been introduced in the paper titled Regularization with the 
Smooth-Lasso procedure, in |14| . Results can be found there for the this method which are not provided here, 
such as the theoretical performance when p < n and a simulation study from a variable selection point of view. 
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case, that is, when many of its unknown components are zero. Thus only a subset of the design 
covariates {Xj)j is truly of interest where Xj = (xij, . . . j = 1, . . . ,p. Moreover, we 

are interested in the high dimensional problem where n and we consider p depending on 
n. In such a framework, two main problems arise: the interpret ability of the prediction and 
the control of the variance in the estimation. To tackle these problems we use regularized 
selection type procedures of the form 

/3 = argmin{||y-X/3||2 +pen(/3)}, (2) 

where X = {x'l, . . . , x^)', Y = (yi, . . . , yn)' and pen : — )• M is a positive convex function 
called the penalty. For any vector a = (ai, . . . ,a„)', we have adopted the notation ||a||^ = 
X^ILi I flip s-iid we denote by < •, • >n the corresponding inner product in M". The choice of 
the penalty appears to be crucial. On the one hand, although well-suited for variable selection 
purpose, concave-type penalties (see for example O [131 [32]) ^^re often computationally hard to 
optimize. On the other hand, Lasso-type procedures (modifications of the ii penalized least 
square (Lasso) estimator introduced in |29) ) have been extensively studied during the last few 
years. See for example [3l[U[71[30] and references therein. Such procedures are suitable for our 
purposes as they perform both regression parameters estimation and variable selection with 
low computational costs. We will explore this type of procedures in our study. 

In this paper, we propose a novel estimator, denoted by j^Q^'^'^^ which is a modification of 
the Lasso. It is defined as the solution of the optimization problem ([2|) for a combination of 
the Lasso penalty [i.e., X]j=i quadratic penalty /3'J'J/3 for some mx p matrix J 

(m G N*). 

The matrix J typically reflects some underlying geometry or structure in the true signal. 
More generally, the matrix J can be chosen so that sparsity of (3* translates to some other 
desired behavior depending on the context. There is a wide variety of interesting applications, 
and what we present below is not meant to be an exhaustive list but rather a small set of 
illustrative examples that motivated our work on this problem. We add this second term 
to the Lasso procedure for two major issues. First, we exploit this second penalty to take 
into account some prior information on the data or the regression vector (such as correlation 
between variables or a specified structure on the regression vector). Second, the quadratic 
penalty is introduced to overcome (or to reduce) theoretical problems observed by the Lasso 
estimator. Indeed, (see for example [3ll^ll8 | l2 H lM l l37 1 l40 | l41)) strong conditions to guarantee 
good performance in prediction, estimation or variable selection for the Lasso procedure are 
required. See also [33] for an overview of the conditions used to establish the theoretical 
results according to the Lasso. It was shown that the Lasso does not always ensure good 
performance when high correlations exist between the covariates. In this paper, we establish 
theoretical results showing good performance of pQ^"-'^ under a weaker assumption than the 
Lasso estimator. The improvement is especially observed when the Lasso achieves only poor 
results. 

Two particular cases of the estimator fjQ^°-d. ^^.^ mainly considered: the Elastic-Net introduced 
in |42) to deal with problems where correlations between variables exist. It is defined with the 
quadratic penalty term Yl^=i l^j- '^^^ second and novel procedure is called the Smooth-Lasso 
(S-Lasso) estimator. It is defined with the ^2-fusion penalty, that is, X]j=2 (f^j ~ Pj-i)'^ ■ The 
^2-fusion penalty was first introduced in |17) . This term helps to tackle situations where the 
regression vector is structured such that its coefficients vary slowly. Let us call the regression 
vector 'smooth' in this case. Note, however, that our theoretical study takes into account a 
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large amount of procedures such as the closely related 'Weighted Fusion' introduced in [10]. 
This is detailed in Remark [TJ 

The main contribution of this paper is the introduction of the Smooth-Lasso estimator 
which significantly improves (both in theory and in practice) the performance of the Lasso 
and the Elastic-Net in some situations. However, the method is a special case of the estimator 
pQuad^ This type of estimators aims on 

• capturing the sparsity and some other structure (smoothness in the case of the S-Lasso) ; 

• reducing the assumptions on the Gram matrix and providing theoretical guarantees in 
situations that are not suitable for the Lasso (correlations between successive covariates 
in the case of the S-Lasso). 

From a practical point of view, some problems are also encountered when we solve the 
Lasso criterion (for instance with the LARS algorithm [H]). Indeed, this algorithm fails 
to select a complete group of correlated covariates. We describe two disadvantages of the 
Lasso. First, the Lasso is not consistent neither in variable selection nor in estimation (bad 
reconstitution of /3*). In this paper, we focus on the estimation issue. We consider the case 
where the regression vector /3* is structured. We invoke the S-Lasso estimator to respond 
to such problems where the covariates are ranked so that the regression vector is 'smooth' 
(that is, the vector /3* has only small variations in its successive components). We will see 
with the help of simulations that such situations support the use of the S-Lasso estimator. 
This estimator is inspired by the Fused-Lasso |30| . Both S-Lasso and Fused-Lasso combine a 
£i-penalty with a fusion term [17|. The fusion term is designed to make successive coefficients 
as close as possible to each other. The main difference between these two procedures is 
that we use the £2 distance between the successive coefficients (that is, the ^2-fusion penalty: 
X]^=2(/^j whereas the Fused-Lasso uses the ii distance (that is, the £i-fusion penalty: 

Ylj=2 ~ /^j-iD- Hence, compared to the Fused-Lasso, we sacrifice sparsity in changes 
between successive coefficients in the estimation of (3* for an easier optimization due to the 
strict convexity of the ^2 distance. This implies a large reduction of computational cost. 
However, sparsity is, nonetheless, ensured by the Lasso penalty. The -^2-fusion penalty helps to 
provide 'smooth' solutions. Consequently, even if there is no perfect match between successive 
coefficients, our results are still interpretable. From a theoretical point of view, the £2 distance 
also helps us to provide theoretical properties for the S-Lasso which in some situations appears 
to outperform the Lasso and the Elastic-Net (c/. |42)). Let us mention that variable selection 
consistency of the Fused-Lasso and the corresponding Fused adaptive Lasso have also been 
studied in |27) but in a different context from the one in the present paper. The results 
obtained in |27J are established not only under the sparsity assumption, but the model is also 
supposed to be piecewise constant, that is, the non-zero coefficients are represented in a block 
shape with equal values inside each block. 

Many techniques have been proposed to address the weaknesses of the Lasso. The Fused- 
Lasso procedure is one of them. Additionally we give here some of the most popular alternative 
methods. The Adaptive Lasso was introduced by [_41j. It is similar to the Lasso but with 
adaptive weights used to penalize each regression coefficient separately. This procedure reaches 
under certain (strong) conditions Oracles Properties (that is, consistency in variable selection 
and asymptotic normality, see |41|). Another approach is the Relaxed Lasso (see |20|). which 
aims on double-controlling the Lasso estimate: one parameter to control variable selection and 
another to control the shrinkage of the selected coefficients. To overcome the problem due to 
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the correlation between covariates, group variable selection has been proposed in [36] with the 
Group-Lasso procedure which selects groups of correlated covariates instead of single covariates 
at each step. A first step to the variable selection consistency study has been proposed in [Ij 
and Sparsity Inequalities were given in (HJ [19]. In [42j, another choice of penalty has been 
proposed with the Elastic-Net. This penalty has also been studied for example in [5l [151 143| . 

The rest of the paper is organized as follows. In the next section, we introduce the estimator 
pQuad (jg£ned with the Lasso penalty together with a quadratic penalty. In particular, we 
define the S-Lasso estimator and a notion of smoothness. We also provide a way to solve the 
i^Quad ppoblem with the attractive property of piecewise linearity of its regularization path. 
Consistency in estimation and variable selection in the high dimensional case are considered 
in Section [3l We moreover provide some examples in favor of the Elastic-Net and the S-Lasso 
in Sections 13.1.21 and 13.1.31 and technical issues in Section 13.31 We finally give experimental 
results in Section U which show the S-Lasso performance compared to some popular methods. 
All proofs are postponed to the Appendix. 



2 The S-Lasso procedure 

In many applications for example in macroeconomics, financial time series analysis, and bi- 
ological and medical sciences one often deals with data with given complex attributes and a 
'smooth' solution. This is, for instance, the case in trend filtering (see |16) for a nice survey). 
As a start, let us provide a definition of a 'smooth' vector: 

Definition 2.1 (Smoothness). Let a be some positive number. A vector /3 E is a-smooth 
( or simply smooth ) if 

E(/3,-/3,-i)'<a. 
i=2 

In the applications mentioned above, the regression vector f3* is smooth. Hence, it is 
important to consider estimation methods which can reflect this aspect of the problem. It is 
often useful to assume that the regression vector is also sparse in order to be able to treat 
data such as spectrometry or some genomic data, where both smoothness and sparsity appear 
simultaneously. For these reasons, it is worth introducing and analyzing a method which can 
reconstitute sparse and smooth regression vectors. Hence, we define the S-Lasso estimator 
(3^^ as the solution of the optimization problem ([2]) with the penalty 

p 

pen(/3) = A|/3|l+;u5;(/3,-/3,■_l)^ (3) 

where A and ij, are two positive parameters that control the sparsity of our estimator and 
its smoothness. For any vector a = (ai, . . . ,ap)' and integer q, we have used the notation 
\a\q = Yl'j=i I'^jl'^- Note that the Lasso estimator is a special case of the S-Lasso with n = 0. 
More generally, we consider the following penalty 

pen(/3) = A|/3|i+/x/3'j'J/3, (4) 

where J is a given m x p matrix (m G N*). This penalty is a combination of the Lasso 
penalty and a quadratic penalty. The matrix J typically refiects some underlying geometry or 
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structure in the true signal (we refer to [31 ] for similar ideas). Let us call ^Q'^°-<^ the solution 
of the minimization problem ([2]) and The S-Lasso penalty is a particular case of the 

penalty dU with J given by 



/O 

1 











-1/ 



(5) 



The Elastic-Net corresponds to the case where J is the identity matrix. 

1 the sign of the sample 

correlation between predictor variables j and k. Denote also by wj^k ^ some predictor corre- 
lation driven weights. Given this notation, the Weighted Fusion introduced in flO^ corresponds 
to the case where the k-th diagonal term of J equals Wk k {J)kj = (J)j fc = ~SjkWjk for 

Now we deal with the solution fjQ'^"''^ of ([2]) and ([4]) and its computational costs. The 
following lemma shows that pQ^"-'^ can be expressed as a Lasso solution by expanding the 
data artificially. 

Lemma 1. Given the dataset {X,Y) and the tuning parameters (A,^), define the extended 
dataset {X,Y) and e by 



X 



X 
nJIJ 



and Y 



and e 



e 



where is a vector of size p containing only zeros, e = {ei, . . . , is the noise vector and J 
is the m X p matrix given by the penalty (|4]). Then, we have Y = XP* +e, and the estimator 
pQuad^ (l^jl^gd lliQ solution of the minimization problem (j2]) with the penalty given by (jlj, 
is also the minimizer of the Lasso-criterion 



1 

n 



Y -XI3 



+ A|/3|i. 



(6) 



This result is a consequence of simple algebra. It motivates the following comments on the 



estimator 



Juad 



Remark 2 [Regularization paths). LARS is an iterative algorithm introduced in 112]. A 
modification of LARS can be used to construct f^Q^"-'^ , For a fixed fi, it constructs at each step 
an estimator based on the correlation between covariates and the current residual. Each step 
corresponds to a value of A. Then, for a fixed jx, we obtain the evolution of the coefficients 
values o//3'^""'^ when A varies. This evolution describes the regularization paths of fi^^"''^ which 
are piecewise linear (see }28\/). This property implies that (again for fixed fi) the problem ([2]) 
and Q can be solved using the LARS algorithm with the same computational cost as the 
ordinary least square ( OLS) estimate. 
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3 Theoretical results in the high dimensional setting 



In this section, we study the performance of the estimator ^Q""'^ in the high dimensional case. 
In particular, we provide a non-asymptotic bound for the squared risk. We also provide a 
bound for the I2 estimation error of fjQ^"-d._ Let 

J = J'J, 

be the p x p matrix where J is the matrix appearing in the quadratic penalty (jH . Since our 
main interest is the study of the S-Lasso estimator, we first focus on the case where the matrix 
J is sparse. We refer the reader to Section [3.31 where we address several technical points, for 
example the study of the case where the matrix J is general. 

All the results of this section are proved in Section (6] These theoretical contributions rely 
partly on Lemma [TJ Let us finally mention that the tuning parameters A and fi will actually be 
chosen depending on the sample size n. We emphasize this dependency by adding a subscript 
n to these parameters. 

3.1 Sparsity Inequality when J is sparse 

Now we establish a Sparsity Inequality (SI) achieved by the estimator f^Q^"-'^^ that is, a bound 
on the squared risk that takes into account the sparsity of the regression vector f3* . More 
precisely, we prove that the rate of convergence of f^Q^'^d. jg max(|^*| log(n)/n;/i^| J/3*|^), 
where A* is the sparsity set A* = {j : /3T 7^ 0}. This rate depends not only on the sparsity 

index |^*| but also on |J/3*|. In the case of the S-Lasso, this last quantity is related to the 
smoothness of the vector (3*. Let us first present the assumptions needed, and the setup of 
this contribution. Let rj G (0, 1) be given ((1 — 77) will be a confidence bound, see Theorem[T]). 
We define the tuning parameter A„ as 

A„ = 4V2aJ^i^. (7) 
V n 

For now, we leave the calibration of pn free. We discuss later (see Corollary [1] and Section r3.1.1l 
for example) the choice for this parameter. Our assumption on the Gram matrix := 
n^^X'X involves the symmetric p x p matrix Kn defined by 

= + PnJ- (8) 

Given the expanded dataset defined in Lemma [1] we note that Kn = n^^X'X can be seen as 
an expanded Gram matrix. Let C {1, ■ ■ ■ ,p} be a set of indices. Using this notation, we 
formulate the following assumption: 

Assumption B{Q): Let Kn be the matrix given by ([8]) and let Qn = 4y^|^A*| -|- ^^|J/3*|2. 
There is a constant (j)^^ > such that, for any A G that satisfies '^j^q |Aj| < 

A'KnA>cp^^J2^l (9) 
iee 

Here are some comments about this assumption: 



6 



first of all, Assumption B{@) is inspired by the Restricted Eigenvalue (RE) Assumption 
introduced in [3]. The RE assumption is widely used in the literature and requires 
somehow that the restriction of the matrix to the rows and columns in Q is invertible 
(when Kn is invertible, the condition ([9]) is always satisfied with (j)^^ at least as large as 
the smallest eigenvalue of Kn) . We refer to [H [33] for more details on this assumption. 
The main difference with the assumption we use is that in [3] the authors consider the 
case where Kn = which matches with the Lasso estimator (that is /i„ = in our 
setting). 

In the sequel, let 0o denote (^^^ for /x„ = 0, that is, the case of the Lasso estimator; 

another difference to [3| is that the set on which the assumption should hold is larger 
in Assumption B(Q) than in the RE Assumption. Indeed, in Assumption B(Q), the 



considered vectors A should be such that X^j^e l^il — SnyYljee ^'j j whereas in [3] 
the authors only need to consider vectors A such that Sj^e \ ^ est ■ Y2jeB l^il (^^e 
also [33]). We make this set larger to allow large values of the tuning parameter We 
will explain later why this is desirable; 

in the case of the Elastic-Net, B = ^* in Assumption B{Q). Hence, the assumption 
above is close to Condition Stabil in [5, page 4] for the Elastic-Net. We will consider 
precisely the difference between both assumptions in Section 13.1.21 However, let us 
mention here that in Condition Stabil the condition ([9]) is replaced by 



A'*-A>(</)^f-Mn)|A^.|i 



for a constant > fin, 



• only small subsets B of indices will be considered in Assumption B{Q). More precisely, 
let B C {1, . . . ,p} be a set of indices such including the true sparsity set A* . We will 
consider a set depending on J and on A* , and the sparser J, the smaller B. For instance, 
in the case of the Elastic-Net, B = A*, and in the case of the S-Lasso (that we will detail 
later), the set B is such that \B\ < 3|^*|. Thanks to the sparsity of J, we will see that we 
can assume that there exists a constant c j > 1 such that \B\ < cj|^*| (see Sections 13. 1.21 
and lXOI) . 

Theorem [1] below holds for general matrices J. However we emphasized here the sparse 
case since Assumption B{B) with large sets B is more stringent (with c/)^^ close or equal to 
zero). Hence in the general case, another assumption presented in Section [3.3.11 mav be more 
attractive. We also mention that Theorem [T] is formulated as general as possible. We refer 
to Corollary [1] below for a special case illustrating the superiority of compared to the 

Lasso. 

Theorem 1 (J sparse). Let A* he the sparsity set. Let the tuning parameter he defined as 
in ([7]). Suppose that Assumption B{B) is satisfied with a set B D A* such that \B\ < Cj\A*\ 
for a given constant cj > 1. Then, with probability greater than 1 — rj, we have 

X/?* -X/3«™'^ ' < cP^li2XnyW\+'^l^n\J/3*\2f, (10) 



\2 



_ ^Quadyj^^* _ ^Quad. < i (2A„ + 2^^ | J^] 2) ^ 
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and 



Quad\ ^ 9^~1 

I 1 — ^ 



An 



Theorem [T] states that ^Q*"^<^ achieves a SI which also brings the quantity | J/3*|2 into play. 
A first glance at the bounds above would suggest that /in = provides the best rates. However, 
it is worth noting that (j)^^, one of the main terms of the bounds, also depends on /x„ and 
increases with this parameter since J is positive semidefinite. Calibration of captures the 
tradeoff between slowing down the rate of convergence and being able to address situations 
where the Lasso fails. For instance, the Smooth-Lasso with a large is devoted to problems 
with large correlations between successive variables. In Section 13.1.11 we further discuss the 
importance of a good calibration of and the interest of using ^Q""*^ (with /in different from 
zero) instead of the Lasso estimator. These considerations lead to the following Corollary [TJ It 
points out that the estimator pQ^"-'^ is particularly useful when the assumptions on the Gram 
matrix are so restrictive that the Lasso error fails to be well controlled. 



Corollary 1. Consider the same setting as in, ThcoTGTTi Ql Let X^, 



r] S (0, 1) and /i„ 



log{p/v) 



with 



2|jn2 

greater than 1 — rj we have 



Then, = 6y^\A*\ in Assumption B{B) and with probability 



XP* - Xf3' 



Quad 



2 2880-2 log{p/r]) 



< 



n 



\A*\, 



and 



1/3* - /3' 



Quad I 

^ 72V2a j logip/r]) 



n 



Assume furthermore that the Gram matrix is such that < A^|^*| and that the extended 
Gram matrix Kn is such that cj)^^ > Hn- Then the bound on the Lasso (obtained setting /in = 
above) does not guaranty any control on the errors. In contrast, f]Q'^°-'^ satisfies 



and if (pn 



> 



XP* - Xj3 



_^QW|^ < 36A/fT|J/3*|2 



Quad 



< 72\/2cj| J/3* 



log{p/r]) 



n 



A*\, 



log(p/??) \ 



1/4 



n 



1 3/4 



with probability greater 1 — r/. 



The above bounds are even better when | J/3*|2 is small. One illustration of this corollary 
can be found in the example included in Section 13.1.31 Moreover, we refer to Section 13.31 
for other choices of /i„ which are more suitable when we deal with a general (non sparse) 
matrix J. 

In our simulation study we focus on the particular choice of given in the first part of 
Corollary [TJ However, in real applications, since the parameters An and /Zn depend on the 
unknown regression vector f3* , we tune them with the help of a 2D ten fold cross validation 
over a grid. 
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3.1.1 Discussion around Hn and the rate of convergence 

In this paragraph, we highUght the cases when using fjQ'^"-'^ is useful in the sense of Theorem [TJ 
We mainly consider two aspects. The first one deals with situations (or conditions on the Gram 
matrix ^'") where (pfj,^ is much larger than (po, that is, the settings where the introduction 
of the additional penalty enables the estimator jjQ'^"-'^ to consider problems that cannot be 
treated by the Lasso. The second one is the fact that /U„|J/3*|2 should be dominated by 

For the first point, and to make things more understandable, let us restrict ourselves to the 
above prediction error bound (fTO|) and consider the particular case of the Elastic-Net where 
the matrix J is the identity. 

Because of the definition of 4>^^ (in the particular case of the Elastic-Net), we have (p^^ > jjLn- 
We now discuss the rates of convergence of the Lasso (with c^q) and the Elastic-Net (with 
(/>^„) in different situations. We present the cases in an asymptotic setting with n tending to 
infinity. The results provided in Theorem [T] suggest essentially three regimes: 

• when (f)Q is a constant: in this case, the rate |^*| \og{p)/n is optimal (up to a logarithmic 
factor; c/. [6l Theorem 5.1]). This rate is reached by the Lasso (set /x„ = in the above 
Theorem [1]) and as a consequence the Elastic-Net (and more generally ^Q"'^'^) does not 
help a lot. Indeed, whatever ^„ > 0, the value of (j)^^ does not significantly vary from 
(/>o (although (j)^^ > (po); 

• when (j)Q depends on n but with /in < 0o < 1^ in this case, (j)^^ (and (pQ as well) is an 
influencing term that should be taken into account in the rate of convergence. The rate 
of the Lasso is worse than |^*| log(p)/n. But, since < 0o, the Elastic-Net does not 
cause a big improvement in this case neither; 

• when (j)Q depends on n and /i„ > ^q- clearly here, (j)^ln > 'pQ- Then when (pQ is small 
(or even very small), the rate of convergence of the Lasso is bad (or even the Lasso 
error is not controlled when (pQ < A^|.A*|), whereas the Elastic-Net is guarantied to 
reach the worst case rate (p~^\A*\log{p)/n {cf. Corollary [T] for a bound independent 
on the second term in the LHS of (jlOp ). This can lead to a big improvement. For 
instance. Section [3.1.31 gives an illustrating example pointing out the advantage of using 
the Smooth-Lasso estimator. 

The above remarks recommend large values of /i„ due to the fact that (p^^ grows with 
/i„. However the RHS of (llOp depends on /i„ also through lin\J f^*\2- Then one may choose 
the largest such that the second term in the RHS of (|10p remains reasonable compared 
to the first one. That is the choice of /i„ should make a tradeoff between increasing (p^^ and 
increasing ^n\JI^*\2 in the bound. 

To make things clearer, let us focus on the prediction error (the same reasoning is true for the 
other errors). The rate of convergence is 

{;^|^*| if /in I J/3* 1 2 = 0{Xn\/\A*\) or even smaller in order, 

^|J/3*|i otherwise. 

Then, the term lin\J(i*\2 induces an alteration on the rate of convergence when /U„|J/3*|2 ^ 
A„y^|^*|. In other words, the rate of convergence is worse when we add the quadratic penalty 
unless ff J/3*|2 < A„y|34*|". 
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All these explanations encourage the compromise stated in Corollary [T] above for the cal- 
ibration of jin- In the next two paragraphs we provide a more detailed study in the special 
cases of the Elastic-Net and the S-Lasso estimators. 

3.1.2 Elastic-Net 

The Elastic-Net corresponds to the case where J equals the identity matrix. Then B = 
A* in the above theorem and corollary. The theoretical performance of the Elastic-Net has 
already been considered for example in [5l |T5] . In |15] , the authors considered a version of the 
Irrepresentable Condition to establish their consistency results. This necessary and (almost) 
sufficient assumption for the variable selection task is harder to interpret than ours. The result 
in the present paper (and particularly those in Section 13.3. ip about the Elastic-Net are quite 
close to those in |5J. A comparison between the results obtained here and those stated in [5] 
is postponed to Section 13.3.11 

When compared to the Lasso, we essentially note two differences: first, as mentioned 
before Theorem [H the Lasso brings into play a set of linear inequalities (that is, vectors 
A G such that '^^j^A* l^il — ^ X^jG^* l^jl' instance O [33])) whereas we need 

in Theorem [1] a bigger set induced by a quadratic set of inequalities (that is, A such that 

'Yliji^A* l^il — Sn-\J^j,^A* with Qn > 4:y^\A*\). Even though this difference is small, let us 
mention that we will establish in Section 13.3.11 theoretical guaranties which also require the 
same linear set as in the Lasso case; second, the main difference pertains to the values of (pfj_^ 
and (j)Q. Since (pfj_^ > (j)Q, the Elastic-Net is useful in situations that preclude the use of the 
Lasso because <j)Q is close to zero. This was discussed in Section 13.1.11 For instance, when 
the correlations are high between variables, the Lasso fails, whereas the Elastic-Net achieves 
satisfying performance (see Corollary [T]) . 

Finally, we observe that in the case of the Elastic-Net, Equation (jlip is nothing but a SI 
on the (-2 estimation error |/3* — Note, however, that the rate A„y^|^*|, when jjn is 

defined as in Corollary [H is not optimal (it can be sharper with more restrictive assumptions) 
but has the advantage of only requiring Assumption B(A*). Imposing Assumption B{B) with 
B larger A* , a better rate of convergence can be reached (see Proposition [T]) . We refer to |35| 

Theorem 1]) for lower bounds on the £q estimation error of order |^*|^/'? y^ ^"^^-^^^/!"^ ^\ See 
also [25l|26]. 

3.1.3 Smooth-Lasso 

The S-Lasso corresponds to the case where J = J' J with J given by ([5]). This estimator 
can deal with problems where the regression vector is expected to be a-smooth in the sense 
of Definition 12.11 As a consequence, we have the worst case relation |J/3*|2 < 7|J/3*|2 (the 
constant 7 comes from some rough computations and is not accurate). Note also that in this 
case Assumption B{Q) is satisfied with a set Q = B whose size is less than 3|^*|. This set 
can be expressed as 

B = {je{2,...,p-1}: (3*^0, /3;_i/0 or /3*+i / 0}, 

and Theorem [1] holds with cj = 3. Moreover, Equation (jlip can be seen as a control on the 
'smoothness error' Yl^=2(^j ~ '^j-i)^' where 6j is the components difference /3* — ^Q""*^. 
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The S-Lasso is designed to provide a smooth and sparse solution. This is true whatever 
the correlations between variables. However, it is interesting to remark that the smoothness 
has quite close interactions with correlations between successive variables. Indeed, when we 
deal with the S-Lasso estimator, the matrix J is tridiagonal with its off-diagonal terms equal 
to -1. If we do not consider the diagonal terms, we remark that and differ only in 
the terms on the second diagonals (that is, {Kn)j-ij ^ for j = 2, . . . ,p as soon as 

7^ 0). Terms in the second diagonals of ^" correspond to correlations between successive 
covariates. 

When high correlations exist between successive covariates, a suitable choice of fulfills 
Assumption B{B). Hence, the S-Lasso estimator is particularly useful in situations where we 
expect that the variables are ranked, such that not only the regression vector is 'smooth', but 
also successive covariates are highly correlated. Indeed, on the one hand Assumption B{B) 
is a weaker assumption for 'smooth' regression vector. On the other hand, this 'smoothness' 
makes the prediction and the estimation errors sharper (as (j)^^ depends on |J/3*|2). 

In the next paragraph, we present an illustrating example of Corollary [1] (or Theorem [T]) 
where we show the importance of using the Smooth-Lasso in certain situations where the 
Lasso and the Elastic-Net do not provide good control on the different errors. In particular, 
we present a case where correlations between variables exist (and where the Lasso is not 
suitable). Moreover, since the influence of the quadratic penalty in the definition of f^Q'^"-'^ 
reduces when |J/?*|2 is large (see the definition of /x„ in Corollary [1]) , we consider a smooth 
regression vector with large singular coefficient values such that | J/3* 1 2 is small when J is the 
matrix corresponding to the Smooth-Lasso, and large when J is the identity matrix associated 
to the Elastic-Net. Due to this difference on the value of | J/3*|2, the Smooth-Lasso outperforms 
the Elastic-Net. 



Example. Let J be the matrix defined on ([5]). Assume that n/4 is an integer. First of all, 
let us define a smooth regression vector (3* with n/2 non-zero components such that 

13* = 1 for j = 1, . . . , n/4 - 1, and p* = I - - i^j - - j for j = n/4, . . . , n/2. 

This regression vector is chosen piecewise linear (a particular case of smoothness) to clarify 
the idea and for simplicity of computations. The vector (3* is such that 



/n 1 9 A Ifi 

1/3*12 = 1/3- 2 +^=o(v^), \^p*\^ = \j---2=^^y^)- 



Then, we can set the smoothness parameter a = 4/-^/n in Definition 12.11 

Let us now consider the design matrix Let e > be a real number. Let be a 
tridiagonal Gram matrix with diagonal elements equal to 1 (that is, normalized) and such 
that = = e for j = 2, . . . ,p and A; = 1, . . . ,p — 1. In such a case, the spectrum of 

the Gram matrix lies in [1 — 2e, 1 -|- 2e]. Then, (pQ > l — 2e (the (j)^^ corresponding to the Lasso 
estimate, that is, when /u„ = 0). However, we do not know how far (po is from 1 — 2e so that 
we can only say the the prediction error of the Lasso f3^ is such that with high probability 



X/3* - Xf3 



2 ^ 16V2.nog(p/,) 
n 1 — 2e n 



with the choice e = ^ — ~^2n^' Actually, the above bound does not provide any control on 
the prediction error of the Lasso estimator. 
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Let us now focus on the Elastic-Net estimate j3 . According to Assumption B{A*), we have 
to consider the spectrum of the matrix K!j^^ = 'I'" + ^inlp, where Ip is the identity matrix in 
WP. This spectrum hes in [1 — 2e + 1 + 2e + Given the values of e and of |/3*|2, we get 
the control 



EN 



' < , \ {2XnyW\ +2f^nW*\2f = 0{ammW\og{p)\A*\,\A*\}), 

n i — Ze + fin 



where we used the definition of /i„ provided in Corollary [TJ Let us mention that choosing a 
different value for ^„ does not imply an improvement in the bound. Hence, in this case the 
Elastic-Net estimator does not control the prediction error neither. 

Next, in the case of the S-Lasso the eigenvalues of the matrix K^^ = + fin J lie in 
[1+ fin — 2|e — fin\, 1 + 2/i„ + 2|e — finW- We refer to [38j for more details on the eigenvalues of 
tridiagonal matrices. This interval is of the same order as the one of the Elastic-Net. By the 
sequel, we have the following control for the S-Lasso estimator (when e > fin, otherwise the 
control is even better) 



XP* - X(3 



SL 



< . i2XnVW\ + 2fin\m.? = o L-VM^ 



1 — 2e + \ n 



where here again, we considered the value of fin given in Corollary [TJ In this 'smooth con- 
text', the S-Lasso is obviously the best method (compared with the Lasso and the Elastic- 
Net). Note that this last rate is better than the minimax rate under the sparsity assumption 
log(p/l-^J+i)l-^ I jgj Theorem 5.1]). This is due to the fact that we also imposed a smooth- 
ness assumption which is nicely exploited by the S-Lasso estimator. Thus, the above minimax 
rates cannot be applied anymore. 

Let us conclude with the following remarks: in the above situation, we assume that the 
regression vector is smooth also that the successive covariates are correlated. This is the best 
context for the Smooth-Lasso. 

In the case where the regression vector is smooth, but we do not have a particular structure in 
the Gram matrix (say the variables are independent and is a fixed positive constant), the 
Lasso and the Elastic-Net (for instance with the value of fin given in Corollary[T]) reach the rate 
^2 log(p)|-4 I ^ Compared to the bounds for the Elastic-Net, there are improved bounds for the 
S-Lasso and for suitable values of /x„ (note that fin depends on a). Here again, if we consider 

(\/\os,(p)\A* I 
a— — 

Consequently, we get better performance than the Elastic-Net and the Lasso. 
Finally, when the regression vector is not smooth (say, |/3*|2 and |J/3*|2 are constants) and 
the design matrix is for instance as in the above example, the Lasso is not suitable. In this 
case, both the Elastic-Net and the S-Lasso have comparable performance and their bound is 
in order C'(Y^log(p)|^*|/n), which is much better than the bounds for the Lasso (even if not 
optimal) . 

The above discussion dealt with the prediction and the estimation performance. In the 
next section we consider the variable selection power of fjQ'^"-'^_ 

3.2 Variable selection 

Let us first mention that the estimator f^Q^^-^,^ with the Smooth-Lasso as a particular case, has 
not been introduced for such an objective. Indeed, it is designed to deal with the estimation 
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criterion or, more precisely, with structural questions. However, in some problems ^Q""*^ may 
induce better variable selection properties than the Lasso. 

A large amount of work has been done on the topic of variable selection for Lasso-type 
methods. One important observation is that one has to make a compromise between not 
identifying a low signal level (that is, small coefficients G A*, in absolute value) and 

imposing a strong restriction on the Gram matrix which sometimes seems to be not 
realistic. Moreover, the question of the identifiability of /?* has also to be considered. Since 
we tackle problems where we expect correlations between variables, we take the middle path, 
that is, we impose less restrictive assumptions on the Gram matrix that permit us to recover 
a reasonably low signal level. For this purpose, we first provide a bound on the sup-norm 
1/3* — ;S*^"'^'^|oo) based on a control on the £2 estimation error. 

To this end, we use Assumption B{Q) on the Gram matrix. However the set should be 
larger than the one required in Theorem [TJ To define it, let us denote by C the index-set of the 
m largest components in absolute value of (3* — pQ^"-'^ outside B. Here B is the set introduced 
in Theorem [TJ In this setting m is an integer such that m + < p. 

Assumption B'{B U C): Let Kn be the matrix given by (HI) and let Qn = 4^]^+ ^| J/3*|2. 
There is a constant (/>^^ > such that, for any A G that satisfies X^j^g|Aj| < 

AXA>0^„ A|. (12) 
jeBuc 

The above assumption differs from Assumption B{Q) only in that we restrict W in a dif- 
ferent set than the one used in Condition ([12]). Obviously, Assumption B'(BUC) implies 
Assumption B{B). 

Proposition 1. Let us consider the same setting as in Theorem U\ with the only difference 

that Xn = 2y/2a\J'^^^^ with < < 1. Under Assumption B'{B UC) and with probability 
1 — r] 

|^QW_^*|^ < |/3'3-'^-/3l2 < c(A„vW + /"n|J/3l2), 
where c = 26-^(1 + -^). 

One can exploit the control provided in Proposition [1] to construct a hard-thresholded 
version of jjQ^"-'^ which is consistent in variable selection. Such a construction has already 
been considered is several papers for the Lasso estimate. The methodology closest to ours is 
the one developed in |23) . 

Consider ^Th-Quad ^ ^Th-Quad^ ^ ^ ^ ^ pTh-Quady^ thresholded ji'^'""^ estimator defined 

by 

and zero otherwise, where c is given in Proposition [TJ This estimator consists of fjQ^"-'^ with 
its small coefficients reduced to zero. We then enforce the selection property of jjQ'^'^'^, Vari- 
able selection consistency of this estimator is established under one more restriction on the 
regression vector given now. 

Assumption C: The true regression vector (3* is such that 

min 1/3*1 > 2c(A„vPl +/^n|J/?l2), 
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where c = 20^(1 + -7^) is from PropositionUl and is the term appearing in As- 
sumption B'{B U C). 

Here again, we observe how important the quantity (p^^ is. We want it to be as large as 
possible. 

This assumption bounds from below the smallest regression coefficient in f3* . This is a 
common assumption to provide sign consistency in the high dimensional case. See for example 
[H [181 EH [34l [39I I40| . We refer to |18| for a longer discussion on how these works are related 
in terms of restrictions related to the threshold or the assumption on the Gram matrix. Now, 
we can state the following sign consistency result. 

Theorem 2. Let us consider the thresholded estimator fj'^^-Q^'^^ as described above. In the 
same setting as in Proposition[l\ and under Assumption B'{B U C) and Assumption C 

P (^sign(/3™-Q"''°') = sign(/3*)) > 1 - ry. 

Note that all the remarks established in Sections 13.1.21 and 13.1.31 remain valid also for this 
variable selection result. 



3.3 Technical advances 

We devote this paragraph to several technical considerations. First, we consider the case of a 
general matrix J. Then, we establish the variable selection consistency of a non-thresholded 
version of fjQ'^"-'^, Finally, we provide a relaxation of the assumption on the noise. The reader 
who is not interested in these studies can skip them without consequences for the readability 
of the paper. 

3.3.1 General matrices J 

Theorem [1] is particularly interesting when J = J'J is sparse. In that statement. Assump- 
tion B{B) was needed with a set B D A* which depends on J. More precisely, B contains 
the indices of components which interfere in the sparse product /3* Ju for a given u G (see 
the proof for more details). This set is not too large compared to A* when we consider the 



case where J is sparse. This way to solve the problem allows us to choose 

(c/. Corollary [1]) . In what follows, we consider p x p matrices J (including the sparse case) 
for which we only need an (adapted) RE Assumption. Contrary to the results provided in 
Section 13.11 /i„ is here, for technical reasons, not a free parameter anymore and is fixed in 
advance (see (jl3p below) . This value is smaller than the one given in Corollary [TJ 

Let us first establish the assumptions needed and the setup of this contribution. Let 
r] G (0, 1). We define the regularization parameters A„ and /i„ in the following way: 



= 8V2a\r-^^^^ and ^„ = A„^i . (13) 

We now state the adapted RE Assumption which differs from the usual one introduced in [3] 
only by the matrix to which we apply the assumption (K^ instead of ^'"): 
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Assumption RE: There is a constant (j)^^ > such that, for any A E that satisfies 
T.jiA* l^il ^ ^T.jeA* l^ji have 

A'K^A > cp^^ Yl 

This assumption involves a set of linear inequalities. Then, we clearly have (j)^^ > (pQ (the (p^^ 
corresponding to the Lasso, that is, when ^„ = 0). With this setting, we obtain the following 
result for a general matrix J. 

Theorem 3 (General J). Let A* be the sparsity set and let the tuning parameters {Xn, fJ-n) be 
defined as in (|13|) . If Assumption RE holds, then with probability greater than 1 — r] we have 



Xl3* - XP^"""^ 



and 

113* -^Q--%<8cp-lXn\A% 

Similar bounds were provided for the Lasso estimator in |3]. Let us mention that the 
constants are not optimal. We focused our attention on the dependency on n (and thus on 
p and 1-4.* I). It turns out that our results are near optimal. For instance, for the £2 risk, the 
S-Lasso estimator reaches nearly the optimal rate log(|-^ + 1) up to a logarithmic factor 
(see [U Theorem 5.1]). Moreover, Theorem [3] states a control on an error which is linked to 
the expected prior information which suggested the use of the estimator f]Q'^"-d., 

The results provided in Theorem [T] and more precisely Corollary [U differ from those estab- 
lished in Theorem [3] in a few points. First, the value of fin is larger in the sparse case. Indeed, 



Un equals j and A„ .~l . — in Corollary [T] and Theorem [3] respectively. The former 

value can be much larger for some regression vector f3* . Second, these values of have an 
influence on the error bounds through As a consequence, the bounds in Corollary [1] are 
better than those in Theorem [3l Finally, apart from the considerations on the quantity (j)^^ , 
we observe a modification of the bound of (/3* — P^^"-'^) J — ji'^^"-'^) . Indeed, in Theorem [H it 
involves the term | J/3*|2-\/|^*|, whereas in Theorem[3l | J/3*|oo|-4*| appears, which is obviously 
larger. We then have a better control on this error using the sparsity of the matrix J. Finally, 
we remark that the constant factor in the definition of the tuning parameter A„ in Corollary [1] 
is smaller than the corresponding constant in Theorem [S] One should however mention that 
for a fixed (p^^ (that is a fixed the set of feasible vectors A in Assumption RE is larger 
than the one in Assumption B{B). In this sense. Assumption RE is less restrictive than As- 
sumption B{B). Nevertheless, this difference does not clearly mean that the (p^in resulting 
from the Assumption RE is larger than the one arising from Assumption B{B). Indeed, when 
A is in the feasible set of both assumptions, (p^^ is the same in both conditions. 

A close result to Theorem [3] has been established by Bunea in [5j in the particular case 
of the Elastic-Net. It is worth briefly pointing out here the differences and the similarities 
of our work and [5] when we deal with the Elastic-Net. For any vector h ^ W and subset 
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C {1, . . . ,p}, let 6e be the vector in such that (&0)j = bj if j E B and zero otherwise. 
In [5j, Bunea provided a SI close to the one established in Theorem [3l This inequality holds 
under the Condition Stahil defined in O page 4] by 

where (j/jl^ > and similarly to vectors in Assumption RE, A is such that X^j^^* l^jl — 
^TlijeA* '^^^ above equation is the analogous of the condition (|9]) in Assumption RE, 

and to make the comparison easier, let us write ([9]) as follows 

A'M'-A > - ^ln)\^AAl - /^n|A(^,)c|2. (14) 

Since the bounds in the Sparsity Inequalities stated in [5j and in the present paper are up to 
constants the same, it seems that the only difference is the value of (j)^^ . Indeed, according to 
Inquality (I14|) . (f)^^ can be much larger than (f)^^ (given in Condition Stahil), as we subtract 
the term A(-_4»-)c|2 in ([HI), which can be large thanks to /i„ (we expect however |A(_4.)c|2 to 
be small). It is worth adding that the Elastic-Net corresponds to a case where the matrix J is 
sparse (as J is the identity). Therefore, it is more convenient to use the setting of Section [3.11 
since the value of /i„ is larger there. 

3.3.2 Non-thresholded variable selector 

In Section [3.21 we established variable selection consistency for a thresholded version of fjQ^°-<^ 
when J is sparse. In this section, we state a comparable result for a non-thresholded version. 
Indeed, paying the price of a more restrictive assumption, we provide in Theorem H] below a 
variable selection consistency result directly for /^Q'^"-'^ when using a different calibration of the 
tuning parameters. This result can be applied to general matrices J. The approach to prove 
the result is also different. We first provide a bound on the sup- norm |/3j^, — /3^!"*'^|oo- This can 
be done easily using the theorem stated in Section l3.3.1l for the £i estimation error \f3* 
However, this would imply that only 'high' levels of the signal can be reconstituted, that is, 
coefficients j € A* such that \f3j \ > est ■ An|^*|. Therefore, we favor to exploit here again 
a control on the £2 estimation error — /3'3""'^|2 instead, which in the sequel enables us to 
recover signals with \f3j\ > est • A„y^|^*| with the same assumption on the matrix Kn- Let 
us mention that A„y^|^4*|" is not the best level which can be recovered. One can also get rid 
of the extra term y^|^*| through a quite restrictive assumption on the correlations between 
variables (see Remark [3|). 

Proposition [2] below is a first step to a variable selection result. It states that enables 
us at least to detect the relevant variables (and maybe also some noise variables): 

Proposition 2. Let us consider the same setting as in Theorem\^with the only difference that 
Xn = 4-v/2(T Y^log(j)/?7) /n and fin = A„/(4| J/3*|oo) with < rj < 1. Under Assumption RE, 
and with probability larger than 1 — r], we have 

- /3Sr'^ioo < - /3Sr"i2 < 2r^iXnVW\, 

where 0^,^ is the constant appearing in Assumption RE . Moreover, i/min > 2(f)'^^Xn'\/\A*\, 
we have 

p(sign(/3Sr'^)=sign(/3:^0) >l-r?. 
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Proposition [2] is a trivial consequence of Theorem O A short proof is given in the Ap- 
pendix section. This proposition emphasizes directly that under Assumption RE all non-zero 
components of /3* are detected by with high probability. Actually, in the setting of 

Proposition [21 fjQ'^"-'^ may contain too many non-zero components. More restrictions are 
needed in order to ensure the variable selection consistency of ^<9""'^. Here is an additional 
assumption on the Gram matrix which controls the correlations between the truly relevant 
variables and those which are not. 

Assumption D: We assume that 

t 



max max |(i^„)i 1.1 < , , ,, 

where t is a positive term smaller than . 

This assumption is quite close to the Mutual Coherence assumption which involves the Gram 
matrix instead of Kn- In addition, the Mutual Coherence assumption restricts correlations 
between all covariates. 

Theorem 4. Let consider the linear regression model Let A„ = \J'^^^^^^^^^ 

and Hn = An/(4| J/3*|oo)- Under Assumptions RE and C, and also Assumption D, we have 

pU ^ < r], 

and 

'sign(/3'3«'^'^) = sign(/3*)) > 1 - r?. 

To prove the first claim, we use some arguments from [5]. The second point is a conse- 
quence of the first one and of Proposition [21 There are essentially two differences between the 
settings in Theorem [3] and Proposition [21 First, we need for this last result a more restrictive 
assumption on the correlations between variables. However, this restriction is only between 
relevant variables and irrelevant covariates. This is 'quite' a reasonable assumption to identify 
the relevant variables, that is, the non-zero components of the vector f3* . Second, the min- 
imal value of A„ is larger in this last theorem. This suggests that we need a larger value of 
this tuning parameter to set to zero the irrelevant components. Note that we established the 
variable selection consistency of ^Q""'^ but with a value of the tuning parameter /i„ smaller 
than the one used in the thresholded version. 

Remark 3. The results of Theorem^ can also be obtained under the more restrictive Mutual 
Coherence assumption: maxjg_4* max^gji \{Kn)j,k\ < TT^> where i is a small positive con- 

stant. Here, even the correlations between relevant variables are restricted but this restriction 
makes possible to recover even smaller signal. That is, we can detect coefficients of f3* such 
that \ > est ■ y^log{p)/n. See for instance fS^ in case of the Elastic-Net. 

3.3.3 Non Gaussian noise v^ith finite variance 

Most of the results established for Lasso-type methods assume Gaussian or sub-Gaussian type 
noise [3l [5l [151 EH |39]. Noise with exponential moment is studied in [31 |23]. Only a few 
references consider other type of noise. Noise with moment of order 2k, where /c > 1 is an 
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integer, is considered in [40], whereas in the paper |18) , the author presents the case where the 
noise admits zero mean and finite variance. It is in the same spirit as that in this last reference 
that we consider this relaxation on the noise. According to the Elastic-Net, noise with moment 
of order 2k + 6, where > 1 is an integer and (5 is a positive constant is considered in [43| . 
but the authors treated only the case where p = 0{n). 

We assume that the noise random variables ei,...,e„ are independent and admit zero 
mean and finite variance. That is Ecj = and Ee? < cr^ for i = 1, . . . ,n with ct^ < oo. In this 
generalization we also use a revisited version of Nemirovski's Inequality established in [11) . 
One more restriction is needed on the sample points. 

Assumption E: There exists a positive constant L < oo such that 

n 

n~^\^ max • < L. 

1=1 

Theorem [5] below extends the results in Corollary [1] of Section [3] to the non-Gaussian noise 
case. However, one is able to generalize all the results of that section in the same way. 

Theorem 5. Let consider the linear regression model ([T]) where the Si 's are independent 
random variables with zero mean such that Ee? < for i = 1, . . . , n with cr^ < oo. Denote by 



KNem the quantity KNem = infg6[2,oo)(9-l)p^''^, and let A„ = 4c7y' ^'^^^^ with < r? < 1. Let 

= '''"V^-^ ^ . Assume also that Assumption B{B) (where Qn = Q\/\A*\) and Assumption E 
hold. Then, with probability greater than 1 — r] we have 



Let us mention that 2elog(p) — 3e < Kjsiem < 2elog(p) — e. As a consequence, the rate 
of convergence in Theorems [5] is of the same order as in Corollary [TJ However, the constant 
factor seems to be worse in the non-Gaussian case since it brings into play the constant L 
which can be large. This is the price to pay to adapt to the non-Gaussian noise. 

Remark 4. In the above theorem, rj is fixed. However, one can set rj depending on p (or on 
n) in such way that it decreases to zero as p ^ oo (or n ^ oo). It is interesting to note that 
in this case, we loose a small power of\og{p) (or \og(n)) in the rate of convergence when we 
consider non- Gaussian noise compared to the Gaussian case. 

Using similar reasoning as in Theorem [5] (c/. proof of Theorem [5|), there is no major 
difficulty to extend the variable selection results established in Section 13.21 with Gaussian 
noise to the case where the noise is defined as above. This can be done using Lemma [3] instead 
of Lemma [2] of Section [6] in all the proofs. 

4 Experimental Results 

In this section, we present the experimental performance of the estimator f]Q^"-<^, In particu- 
lar, we focus on two special cases: the Elastic-Net and the S-Lasso defined respectively with 
the penalties pen^^(/3) = A|/3|i + fi\(3\l and pen^^(/3) = A|/3|i + fiE%2i^j " Pj-i?- The 
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Elastic-Net is useful when high correlations between variables appear, whereas the S-Lasso 
is devoted to problems where the regression vector /3* is 'smooth' (small variations in the 
values of the successive components of /?* ) . We are essentially interested in the performance 
of these estimators w.r.t. their estimation accuracy, that is, in terms of the estimation error 
I/? — /3*|2, when /3* is known (simulated data). Indeed, the introduction of ^Q""^ is motivated 
by a priori knowledge on the structure of the parameter /?*, or on the correlation between 
variables, and the purpose here is to see how this information can be taken into account 
to improve the reconstitution of the vector (3* . As benchmarks, we use the Lasso and the 
Fused-Lasso estimators, since the first is the reference method and the second is close in spirit 
to the S-Lasso estimator. Indeed, the Fused-Lasso produces solutions with equal successive 
components ('piecewise linear') |30| . Note also that in the pioneer paper of the Elastic-Net, a 
'corrected' version of this estimator is proposed [42j. There is as yet no theoretical support for 
this method. Moreover, it outperforms the 'non-corrected' Elastic-Net (this 'non-corrected' 
Elastic-Net is denoted by naive in |42j ) in only a very few of the situations we consider in this 
paper. We omitted the results for these 'corrected' versions to avoid digressions. 
Except for the Fused-Lasso solution, all of the Lasso, the S-Lasso and the Elastic-Net solu- 
tions can be computed with the LARS algorithm (c/. Lemma [T|). However, we will not use 
the LARS algorithm in this study. In order to be fair with all the methods, we used the same 
algorithm for all of them. We use an algorithm provided by J. Mairajl which is an implemen- 
tation of a general algorithm given in ^24j . 

In all our experiments, the tuning parameters are chosen based on the 10 fold cross valida- 
tion criterion (for the Fused-Lasso, the Elastic-Net and the S-Lasso, the cross validation is 
performed on a 2d Grid), but we also display the results obtained based on the theoretical 
values. Note that for the Fused-Lasso, we consider the same theoretical values of the tuning 
parameters as for the S-Lasso as they are both motivated by similar applications (this choice 
seems arbitrary, but to our knowledge no precise study has been made for the Fused-Lasso in 
the context we consider). On the other hand, both the Elastic-Net and the S-Lasso involve 
a sparse matrix J in the definition of the estimator pQ'^"-'^, Then, the theoretical values of 
the tuning parameters are A = 2\^ay^\og{p) jn and /i = A\/^4*/2| J/3*|2, in accordance with 
Corollary [1] and Proposition [TJ These quantities depend on unknown parameters. They can 
be used only in the simulation study, otherwise one needs to estimate | J/3*|2- 
The different methods are applied to several simulation examples. They also have been applied 
to a pseudo-real dataset generated from the ribofiavin dataset. 

4.1 Synthetic data 

There are several parameters: the dimension p, the sample size n and the level of noise a. 
They will be specified in the experimental settings (that is, in the different tables and figures 
captions). The first one is classical and has been introduced in the original paper of the Lasso 
|29) . The second simulation, where we are interested in observing the performance of the 
procedures when groups of variables appear, comes from |42| . The last two studies aim on 
determining the behavior of the methods when the regression vector is 'smooth'. 

Example (a) [cr/p]: No particularities. We fix p = 8 and n = 20. Here only /JJ", /Sg and /Jg are 
nonzero and equal respectively 3, 1.5 and 2. Moreover, for j, k £ {1, ... ,8} the design 
correlation matrix ^' is defined by ^j^k = /0~l-'~^l where p g]0, 1[. 

■^http://www. di.ens.fr/~mairal/index.php 
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Figure 1: Performance of the Lasso (L), the S-Lasso (SL), the Fused-Lasso (FL), and the Elastic- 
Net (EN) applied to Example (a) and based on 500 replications. The tuning parameters are chosen 
based on the theoretical study. Left: Evaluation of the prediction error || yiest — Xtest/3\\n in com- 
parison with the performance of the truth (T), that is, || ytest — XtestP* Right: Evaluation of the 
£2 estimation error |/3 — /3*|2. 



Example (b) \p/n/a]: Groups. We have /3j = 3 for j £ {1, . . . , 15} and zero otherwise. We 
construct three groups of correlated variables: = 1 for every j G {1, . . . ,p}; for 

j 7^ k, ^j^k ~ 1 (actually ^j^k = i+giji) due to an extra noise variable) when {j,k) 
belongs to {1, . . . , 5}^, {6, . . . , 10}^ and {11, . . . , 15}^ and zero otherwise. 

Example (c) [p/n/a]: Smooth regression vector. The regression vector is given by = 
(3 — 0.2j)^ for j = 1, . . . , 15 and zero otherwise. Moreover, the correlations are described 
hy ^ j^k = eM-\j - k\) for 0',^) e{l,...,p}^. 

Example (d) [p/n/a]: High sparsity index and smooth regression vector. The regression vector 
is such that f3j = (A + O.lj)^ for j £ {1, . . . ,40} and zero otherwise, and the correlations 
are the same as in Example (c). 

Except when p = 500 where we run only 100 replications, we based all the experiments on 
500 replications. 



Results. The performance of the estimator /3 (which can be the Lasso, the S-Lasso, the 
Elastic-Net or the Fused-Lasso) in terms of the prediction error H^e^^ — ^^g^^/^H^ (on a test 
set (Ytest, ^test) of size n, that is, a set with the same size as the training set) and the £2 
estimation error |/3 — 13* {2 are illustrated by boxplots in Figures [T] to [H For some of these 
experiments, the corresponding computational costs (in seconds) of each method is reported 
in Table [TJ In what follows, we first compare the methods to each other in terms of their 
accuracy. Then, we compare them in terms of their computational costs. Finally, we provide 
some numerical justifications to the theoretical calibration of the tuning parameters of the 
S-Lasso procedure. 

Methods comparison in terms of performance: Let us consider the different examples sepa- 
rately. 

— Example (a): when we consider the procedures induced by the cross validation criterion 
(for the choice of the tuning parameter), we notice that none of them outperforms the others 
even when p = 0.9 (quite large correlation between successive variables). This is observed 
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Figure 2: Performance of the Lasso (L), the S-Lasso (SL), the Fused-Lasso (FL) and the Elastic- 
Net (EN) apphed to Example (b) and based on 500 rephcations. The tuning parameters are chosen 
based on the theoretical study in the first two plots and by 10 fold cross validation in the third. 
Left: Evaluation of the prediction error || yjest — Xtest/3||n, in comparison with the performance of 
the truth (T), i.e., \\Yteat — XtestP'W'i,. Center and Right: Evaluation of the I2 estimation error 

1/3 -rk. 



for both prediction and estimation errors. This is essentiahy due to the good behavior of 
the Lasso in such a situation where the regression vector is sparse but without any particular 
structure. Actuahy, this conclusion holds in almost all the cases even when the tuning pa- 
rameters are chosen based on the theoretical study. However, two observations can be made. 
First, when both of p and a are small, the Lasso estimator performs slightly better than the 
other methods. Moreover, when p is large a small improvement can be observed using the 
Fused-Lasso, the Elastic-Net and the S-Lasso methods when we care about the estimation 
error. This is illustrated in Figure [1] (left and right respectively) where we display the per- 
formance of the methods in terms of the prediction error in Example (a) [1/0.1] (left) and in 
terms of the estimation error in Example (a) [3/0.9] (right). For this example, the Lasso seems 
to be the best method since it involves only one tuning parameter. It moreover has a lower 
(mean) computational cost equal to 0.18 seconds (based on the cross validation criterion) as 
displayed in Table [TJ The S-Lasso, the Elastic-Net and the Fused-Lasso computational costs 
are respectively 3.7, 3.6 and 4.2 seconds. 

— Example (b): with Example (a), this is the least favorable example for the S-Lasso. Indeed, 
here the fifteen first coefficients equal 3. Then the value of the coefficients drops down directly 
to 0. There is a breakpoint in the 'smoothness' in the true regression vector. Figure [5] displays 
the best reconstitution of the regression vector /3* using the S-Lasso solution (which minimizes 
the £2 estimation error since /3* is known). We observe the edge effects (breakpoint in the 
'smoothness') that the S-Lasso cannot solve due to the £2 fusion penalty term. However, 
even in this case, it seems that all the procedures perform in a similar way when the tuning 
parameters are chosen by cross validation. When the noise level is large [a = 15), let us 
nevertheless mention a (very) small improvement using the corrected versions of the S-Lasso 
and the Elastic-Net. Figure [2] (right) illustrates the performance of the methods in terms 
of the estimation error when they are applied to Example (b) [40/50/15]. The Fused-Lasso 
outperforms the other methods slightly in this example (with a = 15) when we deal with the 
estimation performance. 

On the other hand, when the methods are based on the theoretical calibration of the tuning 
parameters, two observations can be made regardless of the noise level (1 < o" < 15): the 
S-Lasso and the Lasso perform better than the other methods in terms of the prediction error; 
the S-Lasso and the Elastic-Net provide good results whereas the Lasso has poor performance 
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Example (c) [1 00/30/3] Example (c) [1 00/30/3] 




Figure 3: Evaluation of the £2 estimation error |/3 — /3*|2 of the Lasso (L), the S-Lasso (SL), the 
Fused-Lasso (FL) and the Elastic-Net (EN) applied to Example (c) and based on 500 replications. 
Left: The tuning parameters are chosen by 10 fold cross validation. Right: The tuning parameters 
are chosen based on the theoretical study. 



in terms of estimation error. This is illustrated in Figure [2] (left and center respectively) when 
the methods are applied to Example (b) [40/50/3]. Note moreover that a similar results are 
also obtained when p = 100 and n = 40. In this case, the behavior of the different methods 
seems to be stable with the parameters p, n and a. This example is quite interesting since 
it corroborates that a good method for the prediction objective can be less efficient for the 
estimation objective (see the performance of the Lasso and the Elastic-Net). 

— Example (c): we consider several values of the sample size n and the dimension p. It turns 
out that here again, when p < n, all the methods behave in the same way when the tuning 
parameters are chosen by cross validation (the S-Lasso induces just a small improvement). 
However, when p > n the S-Lasso is by far better than the other methods. This is illustrated by 
Figure [3] (left) where the £2 estimation error of each method applied to Example (c) [100/30/3] 
is displayed. The same plot is obtained for the prediction error. 

Moreover, when the tuning parameters are calibrated according to the theoretical study, the 
S-Lasso performs the best and the Fused-Lasso the worst. This appears to be true whatever 
the values of the parameters p, n and a. See for instance Figure [3] (right) where the different 
methods are applied to Example (c) [100/30/3] and for the estimation task (the same is 
obtained for the prediction objective). 

Note that in this example, the Fused-Lasso and the Elastic-Net appear to be useless. 

— Example (d): this is with Example (c) the most favorable situation for the S-Lasso estimator 
where the regression vector is 'smooth' with a large amount of non-zero components. The S- 
Lasso estimator seems to dominate its opponents in all the cases and regardless of the sample 
size n, the dimension p, or the noise level a. This observation holds for the £2 estimation and 
the prediction errors. Note that when the tuning parameters are chosen by cross validation, the 
Lasso, the Fused-Lasso and the Elastic-Net have quite close performance. Figure 0] illustrates 
this fact when p < n for the estimation error (left: cross validation; center-left: theory). 
Moreover, Figure |4] (center- right and right) displays the performance of the methods when p > 
n in case where the tuning parameters are based on the theoretical study (note that ranking 
of the methods does not change from the case p < n when the tuning parameters are chosen 
by cross validation). In addition, an interesting observation follows from the experiments on 
Example (d) [100/30/3] (Figure ID- left) . Indeed, here the sparsity index |^*| = 40 and it is then 
larger than the sample size n = 30. In this case, the Lasso has poor performance. However, 
the S-Lasso is still good. Moreover, there even exists a pair (A, fi) (the pair minimizing the £2 
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Figure 4: Evaluation of the £2 estimation error |/? — /3*|2 of the Lasso (L), the S-Lasso (SL), the 
Fused-Lasso (FL) and the Elastic-Net (EN) applied to Example (d) and based on 500 replications. 
Left: The tuning parameters are chosen by 10 fold cross validation. Center-left; Center-right; Right 
The tuning parameters are chosen based on the theoretical study. 
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Figure 5: Best reconstitution of the regression vector /?* (black curve) by the SL-Lasso esti- 
mator (red curve). Left: Application to Example (b) [40/50/15]. Right: Application to Exam- 
ple (d) [100/30/3]. 



estimation error since /3* is known) such that we have a good reconstitution on the regression 
vector /3* (see Figure [5]-right). 

Methods comparison in terms of computational costs: Table [1] displays the computational cost 
(in seconds) of each method on several examples. First note that the Fused-Lasso has the 
largest computational cost in all the simulations whereas the Lasso has the smallest. The 
Elastic-Net and the S-Lasso have intermediate computational costs but are still reasonable 
compared to the Fused-Lasso. More precisely, when the tuning parameters are chosen by 
cross validation, we remark that the computational costs for the S-Lasso and the Elastic-Net 
are about 30 times larger than for the Lasso. This can partly be explained by the number 
of values explored for the tuning parameter (a grid with 20 elements). Actually, since the 
S-Lasso and the Elastic-Net are obtained with a Lasso program applied to expanded data (c/. 
Lemma d]), it turns out that even for fixed A and the computation costs of the Lasso is (a 
bit) smaller than the computation costs of the S-Lasso and the Elastic-Net. This is observed 
for example when we consider the solutions computed when the tuning parameters are chosen 
based on the theoretical study. Except Example (a), where the increase of computational 
cost using the S-Lasso and the Elastic-Net is not justified (since the improvement using the 
Lasso- type methods is quite small), in most of the considered situations it is quite interesting 
to use the Elastic-Net and even more interesting to use the S-Lasso estimator. This is due to 
the 'smoothness' of the true regression vector. 
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Table 1: Computational costs in seconds for the Lasso (L), the S-Lasso (SL), the Fused-Lasso 
(FL) and the Elastic-Net (EN) in several examples illustrated in the above figures. We chose 
either Tuning = Th or Tuning = Cv, depending on whether we consider the methods with 
the tuning parameters based on the theoretical issue or on the 10 fold cross validation. 



Meth. 


Tuning 


Ex. (a) [1/0. IJ 


Ex. (a) [3/0.91 


Ex.(b) I40/50/15J 


Ex.(c) [30/50/31 


Ex.(d) [500/100/3J 


L 


Th ■ 10"*' 


1.1 ± 0.1 


8 ± 41 


5 ± 2 


33 ± 64 


457 ± 243 


Cv 


0.18 ± 0.01 


0.."-) ± 0.2 


0.5 ± 0.1 


1.1 ± 0.3 


12.3 ± 4.9 


SL 


Th ■ 10"* 


5.1 ± 6.4 


8 ± 28 


6 ± 6 


48 ± 81 


967 ± 441 


Cv 


3.7 ± 0.1 


11.1 ± 1.3 


10.2 ± 2.0 


36.2 ± 9.1 


648.3 ± 219.2 


FL 


Th ■ 10"* 


2.6 ± 0.3 


10.0 ± 30.0 


20 ± 12 


518 ± 271 


5996 ± 2019 


Cv 


4.2 ± 0.2 


14.1 ± 1.6 


38.3 ± 5.8 


245.6 ± 64.3 


~ 3 • 10-^ 


EN 


Th ■ 10"* 


4.7 ± 3.."") 


9 ± 43 


5 ± 3 


41 ± 60 


1022 ± 432 


Cv 


3.6 ± 0.2 


11.0 ± 1.3 


10.2 ± 2.0 


35.2 ± 8.9 


637.3 ± 214.0 



Finally, the Fused-Lasso has a large computation cost due to the £i-fusion penalty which 
admits a singularity. Moreover, it does not improve significantly the Lasso estimator in the 
situations we considered in this paper (as observed in the previous part). 
In view of the computational costs related to Example (a) (the first two columns in Table [T|), 
let us finally remark that these costs increase with p, the correlation level between variables, 
and a, the noise level. We observe for example that the mean computational cost of the Lasso 
estimator (when the tuning parameter is chosen by cross validation) is 1.1 seconds when 
p = 0.1 and (7 = 1 and increases to 8 seconds when p = 0.9 and a = 3. 

S-Lasso; theory vs. cross validation: in what follows, we compare both of the version of the 
S-Lasso. That is, we compare the S-Lasso when the tuning parameters are chosen by cross 
validation and when the tuning parameters are chosen based on the theoretical study: 

• first, we compare these two methods in terms of their performance. Figure [6] summarizes 
the comparison between the S-Lasso based on a theoretical choice of the tuning parameters 
(denoted in this part by S-Lasso^'^) and the S-Lasso where the tuning parameters are based 
on 10 fold cross validation (denoted here by S-Lasso'^''). First we can observe that the per- 
formance of both S-Lasso^'^ and S-Lasso*-"^ are close. Moreover, given the results in the part 
^Methods comparison in terms of performance^ they both perform in a good way. However, 
it seems that S-Lasso*^^ outperforms S-Lasso"^'^ when we deal with the prediction task. This 
seems quite intuitive since by definition, the cross validation criterion attempts to provide 
good estimator for the prediction objective. According to the £2 estimation goal, we cannot 
conclude the superiority of one of the estimators on the other. Nevertheless, in the high di- 
mensional setting Example (d) [SOO/lOO/cr], it seems that S-Lasso'-^'' begins to become better. 

At least, the theoretical choice for p (p = ^y^,^ ) provides good performance both in terms of 
£2 estimation error and test error. They are often close to the performance of the S-Lasso esti- 
mator based on the cross validation criterion. This is quite interesting since the computational 
cost of S-Lasso"^^ is much smaller than S-Lasso*^''. This study is actually more a verification of 
our theoretical choices of the tuning parameters than a rule to apply in practice. Indeed, since 
the theoretical choice of p depends on /3* , the corresponding estimator S-Lasso"^^ is unusable 
in real data problems; 

• second, we evaluate the values of the tuning parameters in both cases. Table [2] displays the 
values of the tuning parameters (A, p) of the S-Lasso, when they are chosen by cross validation 
{\^'" , p^^) and based on the theoretical values {X'^^ , p'^^). We compare them to the values of 
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Figure 6: Evaluation of the I2 estimation error |/3 — /3*|2 (top) and the prediction error | litest — 
XteBt$\\n (bottom) of the S-Lasso based on 500 replications. For each subplot: Left: The tuning 
parameters are chosen by 10 fold cross validation. Right: The tuning parameters are chosen based 
on the theoretical study. We refer to Table [5] for an evaluation of these tuning parameters 



the parameters (A , ) that minimize the £2 estimation error. 

A first remark is that the values of the tuning parameters calibrated based on the theoret- 
ical study are always larger than those chosen by cross validation. This is not surprising since 
the theoretical calibration of the tuning parameters is fixed to capture smoothness with a large 
value of fin- It then turns out that the theoretical considerations leads to 'smoother' solutions 
than the cross validation. Note however that A^'^ > A'^'' does not imply that the solution 
based on the theoretical issue is sparser since a larger fi usually implies that the solution is 
less sparse. 

Regarding the best solution (where the tuning parameters minimize the £2 estimation error), 
there are two cases. When the true regression vector is not smooth, it seems that these 'best' 
tuning parameters are closer to the ones chosen by cross validation. When the true regression 
vector is smooth, they are closer to the tuning parameters calibrated based on theory. To sum 
up, on can say that the best A is close to the one chosen by cross validation, whereas the best 
/u is closer to the one based on theory; 

• finally, we compare both of the methods in terms of their estimation accuracy of 3/3*. 
Table [3] summarizes the results. The first four rows displays the median values of |J/3|2 
when (3 denotes the S-Lasso estimator. We compare the three ways to calibrate the tuning 
parameters. We observe that the S-Lasso based on cross validation (S-Lasso*^^) provides 
satisfying estimations of |J/3*|2. We also note that the S-Lasso based on the theoretical values 
of the tuning parameters (S-Lasso^'^) is particularly good in Examples (c) and (d). This is 
not surprising since the regression vector in these examples is smooth. It behaves similarly as 
the best S-Lasso solution (in terms of the minimization of the £2 estimation error). 
Since A"^'* and fi^^ depend on | J/3*|2 and |^*| (c/. Corollary [1]), one can intent to use S-Lasso*^" 
to estimate these two quantities. In this way, one would be able to compute S-Lasso^'^ even in 
real dataset experiments. However, our experiments reveal that S-Lasso'^'' may overestimate 
the number of nonzero components as illustrated by the four last rows of Table [3] (this is also 
a well-known fact). Nevertheless, we do not exclude this approach, which can be helpful to 
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Table 2: Median values of the tuning parameters (A, /u) of the S-Lasso for different ways of 
calibration: 'Cu ' for cross validation; T/i' for theoretical choice; ^Esf for £2 estimation error 
minimizers. The tuning parameters displayed here correspond to the experiments illustrated 
in Figure [6j 



Tuning 


Ex.{a) [3/0.9] 


Ex.{b) [100/40/3] 


Ex.{c) [100/30/3] 


Ex.{d) [500/100/3] 




0.4 


0.5 


0.3 


1.0 




0.0005 


0.0003 


0.2 


0.1 




2.7 


2.8 


1.1 


2.1 




0.5110 


1.3100 


0.4 


1.2 




0.7 


1.0 


0.3 


1.0 


^tJst 


0.2500 


1.2500 


0.3 


2.0 



Table 3: Ivledian value of |J/3|2 (four first rows) and median number of nonzero components 
1^1 (four last rows) of the S-Lasso for different ways of calibration of the tuning parameters: 
'Ci;' for cross validation; 'T/i' for theoretical choice; ^Est^ for £2 estimation error minimizers. 
The third quantiles are displayed in brackets. The values in this table correspond to the 
experiments illustrated in Figure [6] (and in Table [2] as well). 



Tuning 


Ex.(a) [3/0.9] 


Ex.{b) [100/40/3] 


Ex.{c) [100/30/3] 


Ex.{g) [500/100/3] 


|J/9*|2 


3.5 


3 


2.4 


2.8 




4.4 [6.0] 


4.7 [6.0] 


2.5 [2.7] 


4.0 [4.4] 


|jr''i2 


0.9 [1.1] 


1.8 [1.8] 


2.3 [2.4] 


2.9 [2.9] 




1.8 [2.7] 


1.8 [2.6] 


2.3 [2.4] 


2.7 [2.8] 


1^*1 


3 


15 


15 


40 




5 [7] 


35 [41] 


29 [33] 


74 [82] 




6 [7] 


17 [19] 


18 [21] 


53 [57] 




6 [7] 


40 [58] 


33 [37] 


102 [113] 



provide closer performance to those of S-Lasso . 

Conclusion of the experimental results. The S-Lasso has good performance when the 
regression vector is 'smooth' {Examples (c) and (d)). Nevertheless, even in situations made in 
favor of the Elastic-Net and the Fused-Lasso {Examples (b)), the S-Lasso performs similarly 
as the other methods when the tuning parameters are chosen based on the cross validation 
criterion. The S-Lasso is even better in these examples when the methods are constructed 
based on the theoretical considerations. 

All the results according to the procedures for which the tuning parameters are chosen based 
on the theoretical perspectives is a little unfair in disfavor of the Fused-Lasso. Indeed, the 
rates of the tuning parameters have been calibrated based on a study made for the estimator 
i^Quad Elastic-Net and the S-Lasso are two particular cases of this estimator). For the 
Lasso estimator, we also used the usual rate for A. Even if the Fused-Lasso seems to be close 
to the S-Lasso, it turns out that similar choices for the tuning parameters lead to worse results 
for the Fused-Lasso. 

Based on results on Examples (c) and (d) it seems that the Fused-Lasso and the Elastic-Net 
imply a large bias for large values of /x when the regression vector is smooth (also observed 
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in |10|). They do not improve significantly the performance of the Lasso estimator in such 
situations. Even the 'corrected' Elastic-Net does not provide better results since the artificial 
correction seems to work for a small number of pairs (A, ^) that have to be chosen very 
carefully. 

One can think of two-stage methods to obtain better performance for the Fused-Lasso and 
the Elastic-Net (and also for the S-Lasso and the Lasso), where for instance an ordinary least 
squares is fitted based on the estimated support. This technique reduces of course the bias of 
the procedures and we refer to [2j for a nice theoretical study of such procedures. However, 
we attempt here to examine the performance for the (one-stage) methods and observe how 
well the S-Lasso approaches the true regression vector. 



4.2 Pseudo-real dataset 

We apply all the methods we previously studied on artificially generated dataset from the 
ribofiavin data. These data is about ribofiavin (vitamin B2) production by Bacillus subtilis. 
They kindly have been provided to us by DSM Nutritional Products (Switzerland). In the 
original data, the real-valued response variable is the logarithm of the riboflavin production 
rate, and there are p = 4088 covariates measuring the logarithm of the expression level of 4088 
genes that cover essentially the whole genome of Bacillus subtilis. The sample size is n = 71. 
Here, we are not interested in the riboflavin production, but only in the covariates matrix X 
coming from this application. We use this design matrix to generate an artiflcial response 
vector with a 'smooth' regression vector as in Equation (JT]). Let us mention that this trick to 
generate pseudo-real datasets has already been used in |22| . In what follows, we consider two 
different applications based on the real covariates matrix provided by the riboflavin dataset. 
In the flrst application, say Application 1, let us define X as the 1023 first covariates of 
the riboflavin dataset. Moreover, let us deflne the regression vector /3* such that /3| = 10 • 
exp — i_((j_i25)/i25 1)^ for J = 1, . . . , 250 (c/. Figure [8]) and the noise level a = 3. Hence, 
n = 71 and p = 1023 and then this is a high-dimensional setting with p ^ n where the 
number of non-zero components (the sparsity index |^*|) is larger than the sample size n. 
According to the second application, say Application 2, we restrict X to the 300 first covariates 
of the ribofiavin dataset. The regression vector /3* is such that /JT = 10 • exp — i_((j-25)/25 i)^ 
for j = 1, . . . , 50 (c/. Figure [8]), and the noise level cr = 3. This is a more common high- 
dimensional case where the sparsity index |^*| is smaller than the sample size n. 

Let us now detail the obtained results for different experiments. First, we mention that, 
with the exception of the S-Lasso, all the methods provide an estimation of the regression 
vector which is characterized by large variations in the values of the successive components 
when ^ is small (for the Elastic-Net and the Fused-Lasso) and by large bias when fi is large. 
Hence, we focus here on the S-Lasso estimator. Nevertheless, we display the comparison of all 
the methods in terms of accuracy in Figure [7] when the methods are applied to Application 2. 
Even though the S-Lasso estimator is outperformed when the tuning parameter is chosen by 
cross validation (by the Fused-Lasso for the estimation error and by all the methods for the 
prediction; cf. Figures [7] (left and center-left)), it turns out that we can find a S-Lasso solu- 
tion which performs better than the other methods as displayed in Figures [7] (center-right and 
right). One of the best solution of the S-Lasso estimator in Application 2 can also be seen in 
Figure [8] (left). We observe how the S-Lasso succeeds to reconstruct the 'smooth' regression 
vector /3*. Before considering Application 1, we point out one more fact: in both center-right 
and right plots in Figure [HI the tuning parameters minimize the £2 estimation error. This 
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Figure 7: Evaluation of the £2 estimation error |,3 — /3*|2 and the prediction error Vtest — -^test/5 n 
of the Lasso (L), the S-Lasso (SL), the Fused-Lasso (FL) and the Elastic-Net (EN) applied to the 
pseudo-real data, and based on 20 replications of Application 2. Left; Center-left: The tuning 
parameters are chosen by 10 fold cross validation. Center-right; Right: The tuning parameters 
minimize the £2 estimation error. 
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Figure 8: Best reconstitution of the regression vector /3* (black curve) by the SL-Lasso estimator 
(red curve). Left: On Application 2. Right: On Application 1. 



can provide an explanation of such a bad performance of the Lasso when we consider the 
prediction error (right plot). This also implies the big discrepancy between the Lasso based 
on cross validation (plot center-left) and the one corresponding to the right plot. 
Finally, let us consider Application 1, and let us recall that the sparsity index is here larger 
than the sample size. Figured] (right) displays the best reconstitution of the regression vector 
on this very difficult problem. We observe that the S-Lasso succeeds only partly to recon- 
struct the true regression vector. In the simulation study, we met a similar situation with 
Example (d) [100/30/3] (c/. Figure [5]), where the S-Lasso perfectly estimated /3*. However, 
the situation here is even more difficult since the sparsity index is much larger than the sam- 
ple size and since many high and negative correlations between the covariates appear in the 
riboflavin dataset. 



5 Conclusion 

In this paper, we introduced the Lasso-type estimator ^Q'^°-<^ which consists of two penalty 
terms: a ii penalty term which ensures sparsity and a quadratic penalty term which cap- 
tures some structure in the regression vector. We showed that this estimator satisfies good 
theoretical properties, specifically when the Lasso estimator might fail. As special cases we 
considered the Elastic-Net and the S-Lasso. These methods are interesting in particular when 
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correlations between variables exist or when the regression vector is 'smooth'. We illustrated 
this in a certain setting and an example where /3*5""'^ performs better than the Lasso. 
In a concrete survey, we considered the performance of the S-Lasso estimator compared to the 
Lasso, the Elastic-Net and the Fused-Lasso in terms of prediction and estimation accuracy. 
We found the superiority of the S-Lasso in several simulation experiments where the regression 
vector has a particular structure. We also observed that the theoretical calibration of the tun- 
ing parameters and those obtained by 10 fold cross validation provide similar performances. 
The methods have also been applied to pseudo real examples based on the riboflavin dataset. 
Finally, we pointed out in several simulation studies (see Example (d) /100/30/o"/) the ability 
of the S-Lasso to recover smooth vector even in difficult situations where the sparsity index is 
larger than the sample size. 



6 Proofs 

We first provide two concentration results: the first one deals with Gaussian noise and the 
second one concerns noise admitting finite variance. 

Lemma 2. Let rj £ (0,1). Let < r < 1, be a real number. Let A„^p be the random 
event defined by A„^p = {maxj=i^...^p 2|1^| < tA„} where Vj = 'n^^'}21=i^i,j^i- '^^ define 
An = ^cT-\/n-i log (p/??). Then 



max 2\Vj\ < rXn ] > 1 — r]. 



Proof. Since Vj ~ AA(0, n ^a'^) for any j £ {1, . . . ,p}, an elementary Gaussian inequality gives 
P( max \Vj\>TXn/2] < p max P (|Fj| > rA„/2) 

\j = l,...,p J j = lv,P 

< pexp 



2ct2 

This ends the proof. □ 



Lemma 3. Let r] G (0,1). Let < r < 1, be a real number. Denote also by L the constant 
such that n^"*^ ^"^-^ maxj=i^...^p < L. Let A„^p be the random event defined by An,p = 
{maxj=i^...^p 2\Vj\ < rA^} where Vj = n^^ S^Li such that for any i = 1, . . . ,n, xfj < L 

and the Si 's are independent random variables with zero mean and finite variance Ksf < . 



Denote by KNem the quantity KNem = infqe[2,oo]nlR(9 - 'i-)p^^'^- Then for An = „^ 
we have 



max 2\Vj\ < rAn ) > 1 — 

Proof. This inequality uses an inequality on the expectation of supremum of square of sum of 
independent random variables that can be found in [111 Theorem 2.2]. Let us mention that 
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2elog(p) — 3e < K^em < 2elog(p) — e. Markov Inequality and Theorem 2.2 in (with 
r = oo) imply 

max \VA > tA„/2 ) < f max vf] 



1=1 

i=l 



< — 5 — n > max x,- < r/, 



where we used the definition of A„ = ^^nrj^^ ™ ^^^^ ^^^^ inequality. Theorem 2.2 in jllj 
is used to obtain (fTSl). □ 



Proof of Theorem [71 We provide a first result which may help the legibility of the paper. 
It states that the squared risk and the ^i-estimation error are controlled by the restricted 
^2-estimation error — /3g""'^|2- 

Proposition 3. Let fjQ^"-'^ be the estimator defined by ©-(jl]) with tuning parameters and 
jjLn- Let < T < 1 6e a real number. On the event A„^p = {maxj=i^...^p 2|V^| < tA„} with 
Vj = Sr=i XijSi, if T = 1/2 we have 



A^ 



XP* - Xp"""^ " + —1/3* - < rn\l3*B - '^b'"'\ 



(16) 



where rn = 2\n\/\A*\ + 2/x„| j/3*|2, and B is a set including A* . 
Proof. Let first X, Y and e be the augmented dataset defined by 



X 



X 



and Y 



and e 



where is a vector of size p containing only zeros and J is the pxp matrix given by (|5]). Then 
we have Y = Xfi* +e, and the estimator /j'^"'"^, solution of the minimization problem ^ 
with the penalty given by Q, is also the minimizer of 



Y -XI3 



+ \n\ 



Hence, by definition of the estimator j^Q^°-'^ we can write 



Y - Xj3Q 



uad 



uad\ ^ 1 
11 < TT 



^ + A„|/3*|i 



uad\ 



Xfi* - XpQ' 



uad 



< Xr. 



\f3*U-\pQ' 



uad\ 



Juad\ 
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}uad 



). By the definition of X and e, we have the 
e-X(/3* -^'3"'^'^) -/i„/3*'J'J(/?* -^'3"'''^). The first term 



Let us now consider the term ^e'X(/3 
decomposition - /^Q""'^) = i-' 

in this decomposition is quite common in the hterature and we treat it using arguments which 
can be found for instance in |7j. We then need to adapt those arguments in order to deals 
with the second term of the decomposition /i„/3*'J'J(/3* — ^Q"'^'^) in the same time. Recah 
that A* = {j : /3* 7^ 0} and that J'J = J. Let < r < 1 be a real number. Then, on the 
event A- 



{maxj=i^ 
1 



n 



xp* - xp 



,p2|V^| < rA„} with Vj 

2 
2 

-2/i„/3*'j(/3 



Quad 



1/3* 



1 



(17) 



The remainder of this proof is linked to the way we choose to treat the term J(/3* — /J'^""'^) 
and in particular in the way we choose to link the RHS of Inequality (jl7p to the quantity 
\^A* ~ P^*'^'^\2- We obviously can write 

J(/3* - /3«"'^'^) = -finish' J W*B - PT"'') < Mnl J/3l2|/3S " /3^""'|2, 

where B is the smallest set of indices such that the first equality holds. Note that the set B 
includes A*, the true sparsity set, and is not much larger due to the sparsity of J. 



Now let r = 1/2 in add 2-^^1/3* 
- 1/3* - X/3'5««'i 



^uadl 



to both sides of this inequality. We then get 



Juadi 



1/3* |i - 1$^'"'% + 1/3* -/3<^«"''|i 



Quad I 



(18) 



+2/x„|J/3*|2|/3^-/3j""'^|2 



< 2A, 



p* - /3; 



Quad 



+ 2/i„|J/3*|2|/3^-/3g"'^''|2 



< r„l/3. 



B 



nQuad\ 
Pb |2. 



where r„ = 2A„ yPl +2/in| J/3*|2, since \p% -/Sjr^i < ^/WWa* -'I^a^^ < vPll/^B 



jQuadi 



pQuad^^^ In the second above inequality, we used the fact that ~ f^'j i ' ii^j 
for any j ^ A and to the triangular inequality. This is the claim of Proposition [3] when J is 
sparse. □ 



jQuadi 



+1/3*1-1/3; 



Quad I 







Let us now proof the main theorem. Thanks to Inequality p6|) in Proposition [3l we easily 
obtain that 

\^*-p--\<g^\l3*-^Q--% (19) 

where £»„ := 2r„/A„ = 4y/\A*\ + ^|J/3*|2. Then the vector /3* - /3'3"'^'^ is an admissible 
vector A in Assumption B{B). As a consequence, using this assumption in Equation (|16p . we 
get on one hand 



X/3* -X/3' 



Quad 



< 



a:/3* - a:/3 



Quad 



and a simple simplification leads to the first part of the result 



X/3* - XP 



Quad 



< 



:V2A„v1^ + 2/x„|J/3*|2)2. 



(20) 
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On the other hand, Inequahty (|19p . combined to Assmnption B(B) and Inequahty (j20p . imphes 



which is the desired bound on the ii estimation error given in Theorem [TJ The proof is 
completed when we use Lemma [2] with r = 1/2 to control the probability of the event A^^p. □ 



Proof of PropositionUl We first provide a bound on \(3q — /3q"'^'^|2 for Q = BUC. Theorem [T] 
states a bounds on the prediction error and on the ii estimation error under Assumption B{B). 
Here we do not care about the (.i estimation error. Then one can observe that in the interme- 
diate step between (I17p and (fTSl) in the previous proof, one can avoid the addition of the term 

a consequence, we obtain (j20p but with r = 1 instead of 1/2 in (jl7p . 
Apart from this value of r everything remains the same. 

More particularly, thanks to p9|) we can use Assumption B'{BUC), which directly implies 

X/3* - xpQ'^^'i , with G = 

2 



that the following inequality holds — i3Q^'^^\2 < '\/*^m>!v/ n 
BdC. Combining this inequality with (j20p . we easily get 



|/3^-/3^"'^l2 <2c/>;„i(A„^^ + Mn|Jri2), (21) 

with = B U C. Now, we consider the term \(3qc — ^qc"''^\2- Denote by 6 the vector 6 = 
(3* - p'^'""^ for shorten. For any p-dimensional vector a, let a^^) < 0(2) < ... a(p) be the 
corresponding ranked sequence. Given this new notation, note that for any j G [1, . . . the 
inequality |(^bc|(j) < \5b<^\i x holds. As a consequence 

j>m+l 

where we recall that Q = BUC, with \B\ = m. Then using the last display with (I19p yields to 

where Qn = 4y^|^+ ^^|J/3*|2. Combine this last inequality with (I2ip implies 

|<5|2 < (1 + ^)|<^©|2 < 2^-l{l + ^)(A„v1:A1 + J/3*|2). 
Since \5\oo < \^\2, we obtained the desired control on the sup-norm of /3* — ^'3"'^'^. □ 



Proof of Theorem This result is quite natural since it is a direct consequence of Proposi- 
tion [TJ We refer the reader to the proof of Theorem 2 in |18) for instance. □ 
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Proof of Theorem We consider now the case of general matrices J. Most of the proof is 
similar to the sparse case (Proof of Theorem [1] above). The same reasoning leads to (jl7p and 
the only different occurs when we deal with the term — J(/3* — f]Q^°-<^y We have here 



Quad\ 



1- 



Then, if we set r = 4 and the tuning parameter 



n 



X(3* - xp^"""^ 



1/3*1 



8|jri 

^pQuad 



Inequality (fT7|) becomes 



Juad I 



Add 2-^A„|/3* - /30""'^|i to both sides of the previous inequality and then thanks to the fact 



that 1/3* - /35^"'^''| + 1/3* 



any j ^ A* and to the triangular inequality, the 
above inequality implies that (we refer to the proof of Proposition [3] for similar arguments). 



Xf3* - X/3'5"«'^|^ + yI/3* - /3«"'^'^|i < 2A„vWl/3:4* 



nQuad\ 

Pa* I2- 



This above intermediate result is the analogous of Proposition [3] in the case where J is general. 



That is, we get a similar bound but depending on |/3j^* — /3^!'"'^|2 instead of |/3g — /3g and 

with 

oQuad\ 



jQuadi 



n = 2A„Y^|^*|. Note also that p9|) is replaced by the following linear inequality |/3* — 
1 < 4|/3j^ — /3^"'^'^|i. Taking into account this changing, we use can use Assumption RE 
instead of Assumption B{B) and then a similar reasoning as in the proof of Theorem [T] leads 
to the desired results. □ 



Proof of Proposition\^ Using exactly the same reasoning as in the proof of Proposition [T] but 
based on Theorem |3] instead of Theorem [T] we obtain with probability at least 1 — 

|/3> - /33r% < 2</.;JA„^^, (22) 

since here r becomes equal to 1/2 in Lemma [2l This completes the proof of the first part of 
the Proposition. 

I^QW _ ^ 13* -U < fif"""^ <(i* + U Vi G A*. 

Note that by assumption, we have |/3*| > [/, Vj G A* . Then if we distinguish the case /3* > 
and the case /3* < 0, we easily conclude that /3* > implies p^^"-"^ > Q and /3* < implies 
^Quad ^ This ables us to write 

P(Sgn(/35r'') = Sgn{P%)) > Pd/Sjr" - /3%\oo < U) > 1 - r,, 
and this naturally implies the that A* C A with high probability. □ 
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Proof of Theorem [7} We now show that A C A* with high probabihty. This proof is quite 
inspired by the one by Bunea [5j. First of ah, note that we can write the KKT conditions of 
the minimization problem ^ as 

\K„0Q--^-l3*)-^ + ^^J/3*\^<^. (23) 
Then ah the solutions of the criterion ([6]) share the same active set 

i = {i G {1, . . . ,p} : |(Kn(/3«"^'^ - - ^ + f^niJHA = Y 

That is, all these solutions have non-zero components at the same positions. We now use this 
property to show that the estimator ^Q""'^ has non-zero components at the same positions as a 
well-controlled (but uncomputable) estimator on an event which occurs with high probability. 
For this purpose, let us consider the criterion 

jeA* jeA* 

where recall that for any p-dimensional vector a and any set C {1, . . . the notation oe 

means that (ae)j = CLj,^j G © and otherwise. Moreover, J_4* is such that {JA*)j,k = Jj,fc if 
j, k G A* and otherwise. Define the estimator 

b = argmin F{b), 

beIRP:fc(^*)c=Op 

where Op is the zero in M^. Since we restricted b to be zero when (3* is zero and that this is 
an information we do not have access to, we mention that the vector is not computable. Let 
us denote by O the following event 



kifA* 



jeA' jeA* 



T 



Observe how the event 17 is inspired by the KKT conditions (j23p . Actually, on the event 0, 
the components bk with k ^ A* equals zero as they do not saturate KKT conditions. This 
makes the minimization of F(b) over 6 G : 6(^*)c = Op coincide with the minimization of 
the criterion ^ on fi. That is, the estimator b turns out to be also solution of the original 
criterion mtj on n. But is also solution of §ij and then, as we already pointed, this 

implies that on Q, both of fjQ^"-'^ and b have non-zero components at the same positions and 
then, b has non-zero components at components j G A. Add the fact that by construction 
6(_4.)c = Op, then A C A* on the event Q. It then remains to prove that the event occurs 
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with high probabihty. We have 



< 



< 



< 



HA* 

HA* 



jeA* 
jeA* 
jeA* 



n 



n 



n 



jeA* 



> 



> 



HA* \ jeA* I H-A* 

where we used the fact that for real number a and 6, we have \a\ + \b\ > \a + b\ in the third 
inequahtv and the fact that u,„ = , — in the forth one. Let us consider the last two terms 

in the last display separately, i) First, since A„, = iQq-^J' ^^P^V'^/^^'^p^^ ^ g^j-^^^ using close argu- 
ments to those employed in Lemma[2l we obtain YIh-A* ^ {j'~n~\ — ^) — ^TTp' according 

to Eha* ^ (l EjeA*iKn)jAbj - ^j)\ > we need to control | EjeA*i^n)j,k{h ' 
every k ^ A*. On one hand, Assumption D implies that 

yk^A* I Y.iKn)jAh-l3j)\ < E \h - (25) 

jeA* jeA* 

By definition of b, we just have to repeat the proof of Theorem[3]but with b instead of j^Q^"-'^ and 
only on the true sparsity set A* . We get that on the event A„^_4* = |maxjg_4* \X'j£\ < 
which is the same that A„.p but using A* instead of {1, ... 

\k-l3*\<8<p-'^Xn\A*\. 

jeA* 



Moreover, similar reasoning as in Lemma [2] leads to P y^nA*j — ^TTp' Combine this result 
with (i25l) and get 



J]P I J](i^„),-fc(S,-/3*)|> 



An 



KjeA* 



\A*\X n 

8t 



< 



< 



P 



E \bj-P*\>8cp~^Xn\A* 
jeA* 



provided that t < We finally conclude by this last inequality and ([Ml) that P(^ A*) < 
?7(y^ — h < rj. Then we get the desired result. □ 



35 



Proof of Theorem O This proof is almost the same as the one of Theorem [TJ The only 
difference is the way to control the event A„^p = {maxj=i^...^p 2|l^- 1 < rA„} where Vj = 
SILi when the noise admits only zero mean and finite variance. Then we do not 

use the concentration inequality provided in Lemma [2] for the Gaussian noise but an analog 
concentration inequality more adapted to this type of noise. This concentration inequality is 
given by Lemma [3] and we get 

max 2\Vj \ < rA„ I > 1 — r], 



for a value of A„ = T" y ^^"t?"^ ' '^^^^ r = 1/2 and we plug this new value of the 

tuning parameter A„ instead to the one used to establish the previous results into Theorem [TJ 
We just finish the proof by using the fact that = '^""^^'^ ^ and we obtain the analogous of 

2| J/3*|2 

Corollary [H □ 
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